In this project we will explore data from Prosper loans that contains information on 113937 loans made through Prosper between 2005 and 2014. For the purposes of this project we will imagine that we are a competitor to Prosper and will attempt to use these data to predict the interest rate that a potential borrower would be offered by Prosper so that we can offer a slightly cheaper rate. (for this reason we ignore the ProsperRating variable, which would otherwise be very useful in predicting interest rates)
According to Wikipedia, Prosper is America’s first peer-to-peer lender in which borrowers request funds and other individuals and institutions can fund such loan requests. Although Prosper loans were initially auctioned, since July 2009 rates are set by Prosper’s model and lenders can choose whether or not they wish to fund the loan. Prosper determines its rates using a traditional credit score and its own proprietary Prosper score based on their own historical data.
Now let’s look at some of our variables. First let’s get an overview of Prosper’s activity since it’s inception. One difficulty of peer-to-peer lending is regulation of lending activities tends to be strict at times. Here is a history of Prosper’s monthly loan volume since its inception in 2005. As can be seen on the chart Prosper made no loans for a brief period from the end of 2008 to the middle of 2009 and in more recent months, loan volumes have been rising fast.
The dependent variable we wish to model is BorrowerRate. We are trying to predict the rate that a borrower would be offered on the Prosper website. Also, since Prosper changed how it set the rates for its loans we will create a NewRegime variable that is TRUE if the loan has a ProsperRating, the first of which occurs on July 20, 2009.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1334 0.1830 0.1924 0.2500 0.4975
The data appear to be more normally distributed when square root transformed. The older data contain a number of very low interest rates (including 0) which are not found in the new data. In either event we note a spike in rates at around 35%. We need to investigate this spike since the presence of these outlier values will surely affect the predictions of our linear model.
Below we plot a daily count of all loans greater at rates above 32%. It appears to show spikes and crashes, the first of which seems to coincide with the financial crisis of 2008.
##were they all around the same date?
ggplot(aes(x = LoanOriginationDate), data = subset(ld, BorrowerRate >= .32)) +
geom_histogram(binwidth = 10) +
ggtitle("Count of loans with interest rate greater than 30%")
Let’s look more closely at the loans in the histogram spike. Are they at very specific rates or are there just a lot of loans in that range?
ggplot(aes(x = BorrowerRate), data = subset(ld, BorrowerRate >= .31 &
BorrowerRate < .32)) +
geom_histogram(binwidth= .0001)
Looks like there are a lot of loans at two specific rates. Hard to see exactly what they are in the graph, so we print counts for all of the rates in that area.
w <- seq(.3175,.32, by = .0001)
for (i in w)
{
print(i)
print(length(ld$BorrowerRate[ld$BorrowerRate == round(i, digits = 4)]))
}
## [1] 0.3175
## [1] 8
## [1] 0.3176
## [1] 2
## [1] 0.3177
## [1] 3332
## [1] 0.3178
## [1] 1
## [1] 0.3179
## [1] 2
## [1] 0.318
## [1] 1
## [1] 0.3181
## [1] 0
## [1] 0.3182
## [1] 0
## [1] 0.3183
## [1] 0
## [1] 0.3184
## [1] 0
## [1] 0.3185
## [1] 27
## [1] 0.3186
## [1] 0
## [1] 0.3187
## [1] 0
## [1] 0.3188
## [1] 4
## [1] 0.3189
## [1] 2
## [1] 0.319
## [1] 4
## [1] 0.3191
## [1] 0
## [1] 0.3192
## [1] 0
## [1] 0.3193
## [1] 0
## [1] 0.3194
## [1] 0
## [1] 0.3195
## [1] 10
## [1] 0.3196
## [1] 0
## [1] 0.3197
## [1] 1
## [1] 0.3198
## [1] 3
## [1] 0.3199
## [1] 1649
## [1] 0.32
## [1] 76
We find 1651 loans with an interest rate of exactly 31.99% and 3672 loans with an interest rate of 31.77%. We will look more closely at these loans to see if we can figure out why there are so many at these rates.
Although there is some overlap, it appears that the 31.99% rates were offered in 2011 and the 31.77% in 2012. Is there an equivalent maximum for loans after that date that we have missed? Let’s get a histogram of rates for loans in 2013 and after.
ggplot(aes(x = BorrowerRate), data = subset(ld, LoanOriginationDate >=
as.Date("2013/01/01"))) +
coord_cartesian(xlim = c(.3, .33)) +
geom_histogram(binwidth = .0001)
## Warning: position_stack requires constant width: output may be incorrect
#what's the exact rate?
ld2013 <- subset(ld, LoanOriginationDate >= as.Date("2013/01/01"))
w <- seq(.313,.314, by = .0001)
for (i in w)
{
print(i)
print(length(ld2013$BorrowerRate[ld2013$BorrowerRate == i]))
}
## [1] 0.313
## [1] 0
## [1] 0.3131
## [1] 0
## [1] 0.3132
## [1] 0
## [1] 0.3133
## [1] 0
## [1] 0.3134
## [1] 718
## [1] 0.3135
## [1] 0
## [1] 0.3136
## [1] 0
## [1] 0.3137
## [1] 0
## [1] 0.3138
## [1] 0
## [1] 0.3139
## [1] 0
## [1] 0.314
## [1] 0
For loans 2013 and beyond we find that there are many loans (718) at 31.34%.
We now expect to find a similar rate ceiling in the data between the start of the new rate setting method in December 2009 and the beginning of 2011 when the ceiling appears to have become 31.99%.
pre2011 <- subset(ld, LoanOriginationDate <= as.Date("2011/01/01")
& NewRegime == TRUE)
ggplot(aes(x = BorrowerRate), data = pre2011) +
coord_cartesian(xlim = c(.345, .355)) +
geom_histogram(binwidth = .0001)
## Warning: position_stack requires constant width: output may be incorrect
#what's the exact rate where we find the spike?
w <- seq(.3495,.3502, by = .0001)
for (i in w)
{
print(i)
print(length(pre2011$BorrowerRate[pre2011$BorrowerRate == i]))
}
## [1] 0.3495
## [1] 15
## [1] 0.3496
## [1] 0
## [1] 0.3497
## [1] 0
## [1] 0.3498
## [1] 3
## [1] 0.3499
## [1] 8
## [1] 0.35
## [1] 800
## [1] 0.3501
## [1] 0
## [1] 0.3502
## [1] 0
For loans between December 2009 and January 2011 we find a spike in the histogram of rates at 35.00% with 800 loans.
We create a dataset containing only those loans where BorrowerRate is equal to one of the ceilings we have identified above, and then plot counts of those loans across time. We see that those rates in fact appear to be very limited to specific time windows.
ldHighRates <- subset(ld, BorrowerRate == .3177 | BorrowerRate == .3199 |
BorrowerRate == .35 | BorrowerRate == .3134)
ldHighRates$BorrowerRate <- factor(ldHighRates$BorrowerRate)
ggplot(aes(x = LoanOriginationDate, fill = BorrowerRate), data = ldHighRates) +
coord_cartesian(xlim = c(as.Date("2009/07/20"), as.Date("2014/03/01"))) +
geom_histogram(color = "black", binwidth = 50)
Now we create a new variable IsCeiling for variables in the new regime. It is a Boolean that indicates if the loan was made at the prevailing ceiling rate that we have identified. Note that only a fraction of total loans were made at this ceiling rate (7.6%).
#create the isCeiling variable to indicate the loan was
#made at the prevailing rate ceiling
ld$IsCeiling <-
(ld$BorrowerRate == .35 & ld$LoanOriginationDate < as.Date("2011/01/01") &
ld$LoanOriginationDate > as.Date("2009/07/20")) |
(ld$BorrowerRate == .3199 & ld$LoanOriginationDate < as.Date("2012/01/01") &
ld$LoanOriginationDate >= as.Date("2011/01/01")) |
(ld$BorrowerRate == .3177 & ld$LoanOriginationDate < as.Date("2013/01/01") &
ld$LoanOriginationDate >= as.Date("2012/01/01")) |
(ld$BorrowerRate == .3134 & ld$LoanOriginationDate >= as.Date("2013/01/01"))
So it appears that the spike we see in the histogram of all rates around 34% is due to a cap on the rates that Prosper offers to its borrowers (or at least most of them). That cap changes about once a year, but it remains unclear if it is due to regulatory factors or internal Prosper decisions. According to this article up until April 15 2008 the maximum rate Prosper could offer a lender was determined by regulations in the borrowers state, but after that point there was a cap of 36% for all states except Texas. This suggests that the rate ceilings we are finding in the data are set internally by Prosper.
Since we are ultimately interested in predicting rates offered by Prosper, from this point forward we will mainly be interested in loans made since December 19 2009 under the new rate setting mechanism. We subset the data accordingly before analyzing independent variables.
#create subset of the dataframe for loans that are TRUE for NewRegime
ldNew <- subset(ld, NewRegime == TRUE)
Now we look at our independent variables, many of which are ranked. We expect that any factor that increases the perceived probabilty of loan default will correlate with higher interest rates.
LoanOriginalAmountsummary(ldNew$LoanOriginalAmount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9114 14000 35000
p1 <- ggplot(aes(x = LoanOriginalAmount), data = ldNew) +
geom_histogram(binwidth = 500) +
coord_cartesian(xlim = c(500,26000)) +
ggtitle("Histogram of Loan Amounts")
p2 <- ggplot(aes(x = sqrt(LoanOriginalAmount)), data = ldNew) +
geom_histogram(binwidth = 10) +
ggtitle("Histogram of Square Root of Loan Amounts")
grid.arrange(p1, p2, ncol = 1)
Here we see that most loans are multiples of $5000. The bulk of the loans appear to be in the $5000 range with progressively fewer of the larger loans. A square root transformation of the data yield a more normal distribution but it is still lumpy.
ggplot(aes(x = Term), data = ldNew) +
geom_histogram()
summary(ldNew$Term)
## 12 36 60
## 1561 57664 24070
IncomeRange - we expect that individuals with higher incomes are likely to pay lower rates. Most of the borrowers have yearly incomes of at least $25000.summary(ldNew$IncomeRange)
## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 656 4515 23726 25212 14255
## $100,000+
## 14931
ggplot(aes(x = IncomeRange), data = ldNew) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_histogram()
CreditScore - this number is obtained from a traditional credit report. The untransformed data is positively skewed, but looks more normal when square root transformed.summary(ldNew$CreditScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 699.4 720.0 880.0
ggplot(aes(x = CreditScore), data = ldNew) +
geom_histogram(binwidth = 10)
ggplot(aes(x = sqrt(CreditScore)), data = ldNew) +
geom_histogram(binwidth = .1)
## Warning: position_stack requires constant width: output may be incorrect
IsBorrowerHomeowner - This variable simply indicates whether or not the borrower owns a home. A small majority of borrowers are homeowners.summary(ldNew$IsBorrowerHomeowner)
## False True
## 39291 44004
ggplot(aes(x = IsBorrowerHomeowner), data = ldNew) +
geom_histogram()
ListingCategory - This indicates the stated purpose of the loan. Most lenders are seeking to consolidate other loans. This is even true for those that are taking high interest loans which seems odd since consolidation loans are usually for the purpose of lowering rates.labels = c("Not Available", "Debt Consolidation", "Home Improvement",
"Business", "Personal Loan", "Student Use", "Auto","Other",
"Baby & Adoption", "Boat", "Cosmetic Procedure", "Engagement Ring",
"Green Loans", "Household Expenses", "Large Purchases",
"Medical/Dental", "Motorcycle", "RV", "Taxes", "Vacation",
"Wedding Loans")
p1 <- ggplot(aes(x = ListingCategory), data = ld) +
geom_histogram() +
scale_x_discrete(labels = labels) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Reason for Loan")
p2 <- ggplot(aes(x = ListingCategory), data = subset(ld, BorrowerRate > .32)) +
geom_histogram() +
scale_x_discrete(labels = labels) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Reason for Loan - rates over 32%")
grid.arrange(p1,p2, ncol = 1)
The dataset includes information from 110079 loans measured in 31 variables. We have continuous variables as well as categories and ranked categories.
As noted, the main feature of interest is the BorrowerRate. We would like to model this rate so that we can guess at the rates a borrower would receive from Prosper.
We will investigate LoanOriginationAmount', 'Term', 'CreditScore', 'IncomeRange', 'IsBorrowerHomeowner',ListingCategory’ and Month. We also add data on 3-month T-bill rates (TB3MS) and credit card rates (CCRate) obtained from the FederalReserve.
We created a Month variable to determine if interest rates changed according to the time of the year. We reasoned that there may be times of the year when borrowers are more desperate and willing to accept higher rates.
We also created a NewRegime variable, a Boolean indicating whether or not the loan was offered after the change in Prosper’s methodology. We treat any loan with a ProsperRating as part of the new regime.
As noted above, we created an IsCeiling Boolean that indicates if the loan was made at the prevailing Prosper rate ceiling.
We also created two variables from data obtained outside the dataset. Both to indicate prevailing interest rates at the time of the loan. They are TB3MS to indicate the prevailing interest rates on US 3-month Treasury Bills and CCRate to indicate the prevailing credit card interest rates at the time of the loans. Both these datasets were obtained from the Federal Reserve.
BorrowerRate, LoanOriginalAmount and CreditScore all had long-tailed distributions and so were square root transformed so that the distribution would be more normal. The data were adjusted so that dates would be treated as such by R. We also made sure that variables were of the appropriate data type and that ranked variables were ordered correctly.
The BorrowerRate variable was of particular concern due to a spike in loans at around 35%. This spike represents a challenge for linear modeling of the data. Our investigation suggests that the spike is due to a Prosper internal rate ceiling.
We now want to look at the relationship between our dependent variable (BorrowerRate) and the dependent variables we have identified above.
CreditScoreresult <- as.character(round(cor(ldNew$CreditScore, (ldNew$BorrowerRate)),
digits = 4))
#sqrt(BorrowerRate) vs. CreditScore
ggplot(aes(x = CreditScore, y = BorrowerRate), data = ldNew) +
geom_point(alpha = .02, position = "jitter") +
stat_smooth(method = "lm") +
annotate("text", x = 860, y = .4, label = paste("r = ", result))
CreditScore and BorrowerRate, though we also note that the data are very noisy. A Pearson’s test reveals a negative correlation between CreditScore and BorrowerRate (r = -0.511) and so is a good candidate for inclusion in our model.Since CreditScore has such a strong relationship with BorrowerRate we will investigate the relationship between CreditScore and some of its components to determine which seem to most strongly effect it.
DelinquenciesLast7Years - Most appear to have no delinquencies on their record, but for those that do, it affects their CreditScore negatively. We see that the upper right corner of the graph is empty meaning that those with delinquencies almost never have high credit scores. Alternatively, no delinquencies is not a guarantee of a good credit score.#which are the most important components of CreditScore
ggplot(aes(x = DelinquenciesLast7Years), data = ldNew) +
geom_histogram(binwidth = 1) +
xlim(c(0,40))
ggplot(aes(x = DelinquenciesLast7Years, y = CreditScore),
data = sample_n(ldNew, 5000)) +
geom_point(alpha = .2, position = "jitter")
cor(ldNew$CreditScore, ldNew$DelinquenciesLast7Years)
## [1] -0.2177294
DebtToIncomeRatio - Surprisingly, DebtToIncomeRatio does not have a strong relationship with CreditScore. Some borrowers with debts 10 times thier income still have credit scores over 800.ggplot(aes(DebtToIncomeRatio), data = ldNew) +
geom_histogram(binwidth = .01) +
xlim(c(0,2))
## Warning: position_stack requires constant width: output may be incorrect
ggplot(aes(x = DebtToIncomeRatio, y = CreditScore), data = ldNew) +
geom_point(alpha = .1)
## Warning: Removed 7107 rows containing missing values (geom_point).
ggplot(aes(x = DebtToIncomeRatio, y = CreditScore),
data = sample_n(ldNew, 10000)) +
geom_point(alpha = .2)
## Warning: Removed 845 rows containing missing values (geom_point).
cor(ldNew$CreditScore[!is.na(ldNew$DebtToIncomeRatio)],
ldNew$DebtToIncomeRatio[!is.na(ldNew$DebtToIncomeRatio)])
## [1] -0.01297855
AmountDelinquent - the fact of a delinquency seems to be more important than the amount when computing a CreditScore. As can be seen below, the relationship between CreditScore and AmountDelinquent is weak.ggplot(aes(x = log(AmountDelinquent)), data = ldNew) +
geom_histogram(binwidth = .1) +
xlim(c(0, 20))
## Warning: position_stack requires constant width: output may be incorrect
ggplot(aes(x = AmountDelinquent, y = CreditScore),
data = sample_n(ldNew, 10000)) +
geom_point(alpha = .5, position = "jitter") +
xlim(c(0,5000))
## Warning: Removed 4635 rows containing missing values (geom_point).
cor(ldNew$AmountDelinquent, ldNew$CreditScore)
## [1] -0.0487577
TotalInquiries - We find a weak negative correlation between the number of inquiries on an individual’s credit report and their CreditScore.ggplot(aes(x = TotalInquiries), data = ldNew) +
geom_histogram(binwidth = 1)
ggplot(aes(x = TotalInquiries, y = CreditScore), data = sample_n(ldNew, 5000)) +
geom_point(position = "jitter", alpha = .2) +
xlim(c(0,20))
## Warning: Removed 271 rows containing missing values (geom_point).
cor(ldNew$CreditScore, ldNew$TotalInquiries)
## [1] -0.1451249
InquiriesLast6Months - There is also a weak correlation between CreditScore and the number of recent inquiries.ggplot(aes(x = InquiriesLast6Months), data = ldNew) +
geom_histogram(binwidth = 1)
ggplot(aes(x = InquiriesLast6Months, y = CreditScore),
data = sample_n(ldNew, 20000)) +
geom_point(alpha = .04, position = "jitter") +
xlim(c(0,6))
## Warning: Removed 5233 rows containing missing values (geom_point).
cor(ldNew$CreditScore, ldNew$InquiriesLast6Months)
## [1] -0.08038168
DaySinceFirstCredit - We create this column using the difference between DaysSinceFirstCredit and Date. We find a small correlation between this variable and CreditScore.ldNew$DaysSinceFirstCredit <-
as.numeric(ldNew$Date - ldNew$FirstRecordedCreditLine)
ggplot(aes(x = DaysSinceFirstCredit), data = ldNew) +
geom_histogram(binwidth = 100)
cor(ldNew$DaysSinceFirstCredit, ldNew$CreditScore)
## [1] 0.1101641
EmploymentStatusDuration - We transform this variable with square root to help normalize the distribution, but find almost no correlation between it and CreditScore.ggplot(aes(x = sqrt(EmploymentStatusDuration)), data = ldNew) +
geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(aes(x = EmploymentStatusDuration, y = CreditScore),
data = sample_n(ldNew, 10000)) +
geom_point(alpha = .1, position = "jitter")
## Warning: Removed 2 rows containing missing values (geom_point).
cor(sqrt(ldNew$EmploymentStatusDuration[!is.na(ldNew$EmploymentStatusDuration)])
,ldNew$CreditScore[!is.na(ldNew$EmploymentStatusDuration)])
## [1] 0.03030434
RevolvingCreditBalance - The relationship between RevolvingCreditBalance and CreditScore is not straight forward. Individuals with high credit balances are clustered in the middle range of credit scores. This makes some sense since indivudals with low credit scores are unlikely to have the opportunity to take on much debt and individuals that do take on a lot of debt are likely to see their credit scores go lower. It would be interesting to see historical data to see if the individuals with very high RevolvingCreditBalance previously had much higher credit scores.ggplot(aes(x = log(RevolvingCreditBalance)), data = ldNew) +
geom_histogram(binwidth = .1)
ggplot(aes(x = RevolvingCreditBalance, y = CreditScore),
data = sample_n(ldNew, 20000)) +
geom_point(alpha = .1, position = "jitter") +
xlim(c(0,50000))
## Warning: Removed 1628 rows containing missing values (geom_point).
cor(ldNew$CreditScore, ldNew$RevolvingCreditBalance)
## [1] 0.05663326
BankcardUtilization - we find a relatively strong relationship between this variable and CreditScore. This makes sense since individuals using all of their avaialble credit are not good candidates for additional loans.ggplot(aes(x = BankcardUtilization), data = ldNew) +
geom_histogram(binwidth = .01) +
xlim(c(0,1.3))
## Warning: position_stack requires constant width: output may be incorrect
ggplot(aes(x = BankcardUtilization, y = CreditScore),
data = sample_n(ldNew, 10000)) +
geom_point(position = "jitter", alpha = .1) +
xlim(c(0,1)) +
stat_smooth(method = "lm")
## Warning: Removed 63 rows containing missing values (stat_smooth).
## Warning: Removed 344 rows containing missing values (geom_point).
cor(ldNew$BankcardUtilization, ldNew$CreditScore)
## [1] -0.4454104
IncomeRangeggplot(aes(x = IncomeRange, y = sqrt(BorrowerRate)), data = ldNew) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_boxplot()
#compare credit score with income range
ggplot(aes(x = IncomeRange, y = CreditScore), data = ldNew) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_boxplot()
we would expect that borrowers with higher incomes are less risky and could therefore get better rates. Using Pearson’s test, we calculate a negative correlation between IncomeRange and BorrowerRate (= -0.251). So IncomeRange is also likely to be a good predictor variable for our model.
DebtToIncomeRatio
ggplot(aes(x = sqrt(DebtToIncomeRatio), y = sqrt(BorrowerRate)),
data = sample_n(ldNew, 5000)) +
geom_point(alpha = .1) +
xlim(c(0,1)) +
stat_smooth(method = "lm")
## Warning: Removed 476 rows containing missing values (stat_smooth).
## Warning: Removed 476 rows containing missing values (geom_point).
cor(ldNew$BorrowerRate[!is.na(ldNew$DebtToIncomeRatio)],
ldNew$DebtToIncomeRatio[!is.na(ldNew$DebtToIncomeRatio)])
## [1] 0.1270731
As expected, DebtToIncomeRatio is positively correlated with BorrowerRate, but the relationship is not particularly strong(r = 0.1271)
Termggplot(aes(x = Term, y = sqrt(BorrowerRate)), data = ldNew) +
geom_boxplot()
by(ldNew$BorrowerRate, ldNew$Term, summary)
## ldNew$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1508 0.2064 0.2669
## --------------------------------------------------------
## ldNew$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1274 0.1899 0.1980 0.2699 0.3600
## --------------------------------------------------------
## ldNew$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1486 0.1852 0.1923 0.2310 0.3304
It appears that rates for the 12-month loans are slightly better that for those with longer terms. However, there appears to be very little correlation between BorrowerRate and Term (r = 0.0277).
LoanOriginalAmount -result <- round(cor(sqrt(ldNew$LoanOriginalAmount),
sqrt(ldNew$BorrowerRate)), digits = 3)
ggplot(aes(x = sqrt(LoanOriginalAmount), y = sqrt(BorrowerRate)),
data = ldNew) +
geom_point(alpha = .05) +
stat_smooth(method = "lm") +
annotate("text", x = 170, y = .58, label = paste("r = ", result))
We find a negative correlation between LoanOriginalAmount and BorrowerRate (r = -0.402). This may be due to the fact that the administrative costs of servicing, likely to be relatively uniform for all loans, make up a larger portion of the cost of a loan for small loans.
IsBorrowerHomeowner -ggplot(aes(x = IsBorrowerHomeowner, y = BorrowerRate), data = ldNew) +
geom_boxplot()
wilcox.test(ld$BorrowerRate~ld$IsBorrowerHomeowner)
##
## Wilcoxon rank sum test with continuity correction
##
## data: ld$BorrowerRate by ld$IsBorrowerHomeowner
## W = 1760263974, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
We use a Mann-Whitney-Wilcoxon test to determine if the two populations (homeowners and non-homeowners) are statistically different from one another. The test reveals a very low probability that the two populations come from the same distribution (W = 1760263974, p < 2.2e-16) so we conclude that being a homeowner helps one borrow at a lower rate.
ListingCategoryggplot(aes(x = ListingCategory, y = BorrowerRate), data = ldNew) +
scale_x_discrete(labels = labels) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 5, size = 2)
#ANOVA
fit <- aov(BorrowerRate ~ ListingCategory, data = ldNew)
ftest <- summary(fit)[[1]]
fvalue <- ftest$"F value"[1]
prob <- round(ftest$"Pr(>F)"[1])
The diamonds in the plot above indicate the mean Borrower rate for a given category.
Here we see that loans for which the listing category is “Not Available” tend to have higher rates. This is unsurprising, but it does seem odd that loans for category “Other” are not similarly penalized. Loans for “Baby & Adoption” seem to have relatively low rates, most likely because only more worthy borrowers are seeking loans of this type. It also appears that numerous individuals have taken very high interest loans in order to purchase engagement rings. An ANOVA reveals that there are significant differences in the loans offered for different categories (F = 106.6026, p < 2e-16).
We would expect that the rates on offer on Prosper would correlate with prevailing interest rates for other types of loans. Since most of the loans are for a period of 36 months, we will add a variable to our dataset that indicates the prevailng interest rate for these bonds at the time the Prosper loan was made.
p1 <- ggplot(aes(x = LoanOriginationDate, y = CCRate), data = ldNew) +
geom_line()
p2 <- ggplot(aes(x = LoanOriginationDate, y = TB3MS), data = ldNew) +
geom_line()
grid.arrange(p1,p2, ncol = 1)
ggplot(aes(x = CCRate, y = BorrowerRate), data = sample_n(ldNew, 10000)) +
geom_point(alpha = .1, position = "jitter")
ggplot(aes(x = TB3MS, y = BorrowerRate), data = sample_n(ldNew,10000)) +
geom_point(position = "jitter", alpha = .1)
paste("Correlation between BorrowerRate and TB3MS = ",
round(cor(ldNew$BorrowerRate, ldNew$TB3MS), digits = 4))
## [1] "Correlation between BorrowerRate and TB3MS = 0.0817"
paste("Correlation between BorrowerRate and CCRate = ",
round(cor(ldNew$BorrowerRate, ldNew$CCRate), digits = 4))
## [1] "Correlation between BorrowerRate and CCRate = 0.0333"
In both cases we find only tiny positive correlations between BorrowerRate and the other rates. We also note that in most cases the Prosper rates are much higher than prevailing credit card interest rates. This suggests that people who borrow on Prosper are unable to obtain funding through more traditional channels.
It may be that rates on Prosper are higher or lower at different times of the year. To check that we create boxplots of BorrowerRate showing the rates in each of the different months of the year.
ggplot(aes(x = Month, y = BorrowerRate), data = ldNew) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
result <- summary(aov(BorrowerRate ~ Month, data = ldNew))
f_value <- unclass(result)[[1]]$"F value"[1]
Diamonds in the plot above indicate mean values.
It does appear to be the case that rates are higher in the summer months (F = 227.6630385, p < 2e-16). It also seems to the be the case that in the beginning of the year BorrowerRate is negaitively skewed and at the end of the year it is positively skewed with a relatively small number of high interest loans that bring up the average.
As noted above we find negative correlations between BorrowerRate and CreditScore, IncomeRange and LoanOriginalAmount. We fail to find expected correlation between BorrowerRate and CCRate. We find the length of a loan to be a poor predictor of the rate for that loan.
It was surprising to see that CreditScore and IncomeRange did not correlate very strongly. (r = 0.1461226)
Unsuprisingly, the strongest relationship that we found was between BorrowerRate and CreditScore.
ggplot(aes(x = IncomeRange, y = BorrowerRate, fill = IsBorrowerHomeowner),
data = ldNew) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_boxplot()
At each income range, homeowners tend to pay lower rates.
ggplot(aes(x = IncomeRange, y = BorrowerRate, fill = IncomeVerifiable),
data = ldNew) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_boxplot()
Those that are unable to provide proof of their stated income pay higher rates. This appears to be true even for those that have no income, but the divergence is especially significant at the higher income ranges.
ggplot(aes(x = CreditScore, y = BorrowerRate, color = IsBorrowerHomeowner),
data = sample_n(ldNew, 1000)) +
geom_point(alpha = .8, position = "jitter") +
coord_cartesian(xlim = c(550, 900))
The lowest rates are payed by homeowners with good credit.
ldNew$Income2 <- factor(ntile(ldNew$IncomeRange, 3))
ggplot(aes(x = CreditScore, y = BorrowerRate, color = Income2),
data = sample_n(ldNew, 500)) +
geom_point(alpha = 1, position = "jitter") +
coord_cartesian(xlim = c(550, 900))
High income individuals cluster near the bottom of the chart - they get the best rates. The lower incomes are at the top, while middle income individuals are scattered all over the place.
ggplot(aes(x = BankcardUtilization, y = BorrowerRate, color = Income2),
data = sample_n(ldNew, 500)) +
geom_point(position = "jitter") +
xlim(c(0,1)) +
stat_smooth(method = "lm")
## Warning: Removed 2 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 2 rows containing missing values (stat_smooth).
## Warning: Removed 15 rows containing missing values (geom_point).
For each income range BorrowerRate increases with BankcardUtilization.
At every income level, BorrowerRate tends to be lower for individuals that own homes. IncomeRange is a more powerful predictor when IncomeVerifiable is TRUE. We also find that the lowest rates are, not surprisingly, payed by borrowers with hihg incomes and high credit scores. Low income individuals tend to have lower credit scores and to pay higher rates. Data on middle income individuals are particularly noisy so rates for these individuals are likely dependent on other factors.
It appears that BankcardUtilization is weakly correlated with BorrowerRate for low income individuals, but more strongly correlated with BorrowerRate for middle and high income individuals.
#a pictorial history of ProsperLoans
#start with a histogram of loans per day over time
#overlay important dates in history
d<- data.frame(date=as.Date(c("2006/02/05", "2008/04/15", "2008/11/2008",
"2009/07/15", "2012/03/01", "2013/07/19")),
event = c("Prosper opens for business",
"Rate cap becomes uniform 36%",
"SEC shuts Prosper down", "Prosper website reopens",
"Prosper allows IRA investment",
"Prosper settles lawsuit"))
#make LoanMonth Date by giving every entry 01 as the dat
ld$LoanMonth <- (as.Date(paste(ld$LoanMonth, "-01",sep = ""), "%Y-%m-%d"))
#create a dataframe with monthly mean BorrowerRate and CCRate
monthly_average <- ld %>%
group_by(LoanMonth) %>%
summarise(Prosper = mean(BorrowerRate), CreditCard = mean(CCRate))
#gather into 3 columns - LoanMonth, Rate, Type (cc or prosper)
Rates <- monthly_average %>%
gather(Type, Loan, Prosper:CreditCard)
We note that in recent years and months there has been both a steady increase in the number of loans made by Prosper and a steadt decrease in the rates charged for those loans. This suggests that older loan data may not be particularly useful for understanding Prosper’s current methodology for determining lending rates.
Secondly, we see that Prosper’s rate does not apear to be correlated with credit cards rates, and that thus far Prosper’s rate has been much more volatile. This volatility is probably due to the novel nature of Prosper’s lending methodology, but may alse be due to fluctuations in the supply of loans on Prosper. It might be that as Prosper’s rates continue to go down they may be linked more tightly with other consumption loan rates like those of Prosper.
Finally, taken together these plots suggest that Prosper’s future is bright since it is attracting the more reliable borrowers that it needs to maintain its model.
This plot shows a sample of loans since Prosper reopened on July 15, 2009. It shows BorrowerRate against its 3 best predictors - credit score, income and Loan amount. We see that nearly all of the large loans are to high income borrowers, that low income borrowers tend to pay very high rates and that although average rates on Prosper may be high, some borrowers are getting very attractive rates compared to other comsumption loans. We note also though that high income and a good credit score do not guarantee a good loan as there are some blue circles in the top right of the chart.
The plot above shows the frequency of loans at different rates to borrowers with different credit scores. We see little overlap between the highest and lowest interest rates as the highest rates go to those with the best credit scores and the lowest rate to those with the worst. This plot helps to counteract the impression given by Plot 1 that Prosper borrowers will always do worse than credit card borrowers. In fact we see that many borrowers, especially those with good credit, are able to take loans at rates much more attractive than those typically offered by credit card companies.
In this project we explored the possibility of predicting the rate that Prosper Loans would offer its borrowers based on simple information available from borrowers and their credit reports. We were unable, in this preliminary investigation of the data, to develop a model that would reliably accomplish that task. Nevertheless, important information has been developed that can aid us in the future. We were able to determine for example, that Prosper appears to place an internal ceiling on the rates it offers and that that ceiling seems to change at the beginning of the year. We also found that although aggregate interest rates at Prosper are quite high compared to other comsumption loans, there are also many loans given at much more attractive rates for more qualified borrowers. A surprise finding was the lack of a link between Prosper rates and other prevailing interest rates.
The decision to subset the data to those loans that were made since Prosper reopened with a new methodology, although a sensible one, may have resulted in useful data being discarded. Similarly, the decision to model only a fraction of the available variables allowed for more in depth analysis of those variables, but also deprived the model of potentially crticial information.
The most challenging aspect of this analysis was the large number of data points that it contained (over 100000 different loans in over 80 categories). This made two things very difficult in particular. Because there were so many loans it was difficult to visualize them all together so it was necessary to take samples of the data for visualization and also to use changes in the position and opacity of points on the charts. The large number of variables, and the mixture of continuous, categorical and ordered data made it difficult to determine which variables were the most important to determining rates.
The most important success we had was identifying the shifting rate ceilings used by Prosper. We also indentified CreditScore as the most important factor in determining Proper’s rates. Of course ProsperRating is an even more effective predictor, but we wanted to imagine that we did not have access to internal Prosper data.
Our failure to model the rates that Prosper offers to its lenders suggests several additional courses of action for the future. For one we should spend considerably more time increasing our domain knowledge of Prosper and of consumer lending in general. Second, we should spend more time considering the many other variables that are available from Prosper to determine which of those, if any, can improve our model. Finally it would be of value to consider data provided by other peer to peer lenders (if such data exist) such as LendingTree to see if their rates are similar to those of Proper.
The most important question for the future